Skip to content

sd: sync to master-591-331cfa5#2155

Merged
LostRuins merged 3 commits intoLostRuins:concedo_experimentalfrom
wbruna:kcpp_sd_update_202604_3
May 1, 2026
Merged

sd: sync to master-591-331cfa5#2155
LostRuins merged 3 commits intoLostRuins:concedo_experimentalfrom
wbruna:kcpp_sd_update_202604_3

Conversation

@wbruna
Copy link
Copy Markdown

@wbruna wbruna commented Apr 22, 2026

As I've mentioned back on #2150 , we had a big refactor on the model I/O code.

It touched several sd_get_u8path calls. I've moved them to a wrapper around the top-level ModelLoader::init_from_file; I think it even covers more code paths now. But I have no way to test it on Windows (note the call inside ModelLoader::load_tensors isn't needed because it uses a previously-stored path).

The safetensors reading got changed, and do not apply the tensor prefix together with the file reading anymore. I've moved the kcpp_fix_wrong_img_tensor_name call to what looks like the right place, but that needs to be tested with an affected .safetensors file;

The main issue is that the source includes in sdtype_adapter.cpp only worked after pretty heavy gymnastics (internal function name collisions). We need to seriously consider a more "normal" approach for building: .cpp -> .o -> link, an intermediate static library, or just passing the sources directly on the command line.

There is also a scheduler change that went missing from the last sync.

@wbruna wbruna force-pushed the kcpp_sd_update_202604_3 branch from 3a84f93 to 74cd645 Compare April 30, 2026 00:29
@wbruna wbruna changed the title sd: sync to master-585-44cca3d sd: sync to master-593-3d6064b Apr 30, 2026
@wbruna
Copy link
Copy Markdown
Author

wbruna commented Apr 30, 2026

Updated to master-587-b8bdffc, then master-593-3d6064b.

We got: support for High-Res Fix (which shouldn't change anything, until we add API and UI support on our end), structured metadata on generated images (doesn't matter for us, although we could consider a similar approach), a memory-handling fix for for image to image with tiling, and the big one: removal of all build-time backend dependencies on sd.cpp sources.

The backend change is very incompatible with many ggml includes: I've had to remove util.h from sdtypes_adapter.cpp because it triggered weird errors on llama.h. A better solution could be splitting util.h into two headers, but I didn't want to change stuff outside otherarch/sdcpp in the middle of this PR. I've also had to adapt our main_gpu support, but didn't test it very well.

On the other hand, since sd.cpp became backend-agnostic, sdtypes_adapter.cpp now only needs to be built once for all backends. Both Vulkan and ROCm seem to be running fine with the change.

@LostRuins
Copy link
Copy Markdown
Owner

LostRuins commented Apr 30, 2026

oof, the backend change does not look very pleasant, even though I can see why you added it.

How exactly is backend determination supposed to be done now? because currently when selecting a desired backend (via compile flags for that target), the selected backend is expected to be used for everything (e.g. If i pick CPU, even though I might have cuda or vulkan, I expect to use CPU for TTS, LLM and image gen).

I do not want automatic backend determination from C++ side, kcpp backend selection should always be explicit (which means it must use pure CPU, or pure vulkan with a specified device if the user chooses it, even though I launched the cuda build with CUDA available and supported). I think since ggml is built with the expected flags this should be doable but I'm very unclear on how this auto works right now.

Anyway to make this easier can we do a merge before master-593 first while we figure out a good solution?

@wbruna
Copy link
Copy Markdown
Author

wbruna commented Apr 30, 2026

(I'll reply about the backend issue later)

Anyway to make this easier can we do a merge before master-593 first while we figure out a good solution?

Sure, I'll do another merge with master-591-331cfa5 for the memory fix.

@wbruna wbruna force-pushed the kcpp_sd_update_202604_3 branch from 74cd645 to d4dba49 Compare April 30, 2026 10:10
@wbruna
Copy link
Copy Markdown
Author

wbruna commented Apr 30, 2026

Done.

How exactly is backend determination supposed to be done now?

We first pick the device name with this:

__STATIC_INLINE__ std::string get_default_backend_name() {
    ggml_backend_load_all_once();
    // should pick the same backend as ggml_backend_init_best
    ggml_backend_dev_t dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_GPU);
    dev                    = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_IGPU);
    dev                    = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
    if (dev == nullptr) {
        return "";
    }
    return ggml_backend_dev_name(dev);
}

then initialize it by name.

because currently when selecting a desired backend (via compile flags for that target), the selected backend is expected to be used for everything (e.g. If i pick CPU, even though I might have cuda or vulkan, I expect to use CPU for TTS, LLM and image gen).

It still works like this: only two types of backend will be exposed by ggml at a time (Vulkan and CPU, ROCm and CPU, etc), so it'll prefer the first discrete GPU, then the first iGPU; and use it for everything (the flags --x-on-cpu still work the same way as before).

I do not want automatic backend determination from C++ side

In the sense of, say, Vulkan0 versus Vulkan1, I think the selection has changed: I have a lot of scripts to convince everybody that my dGPU is at a specific slot (it often switches places with the iGPU at boot 😕), and sd.cpp now seems to correctly pick the dGPU regardless of them. But note: even if the previous algorithm was just "pick the first device", technically it was already the C++ layer choosing it (unless forced with main_gpu).

We will likely get more sophisticated selection soon (like "Vulkan1 for the diffusion, Vulkan0 for the VAE, CPU for the conditioner"), but we are not there yet: it's still the same backend for everything, except for --x-on-cpu.

I made a hackish adaptation for main_gpu to keep the interface working as-is, but I believe a more future-proof approach would be:

  • we add a function on sdtype_adapter that publishes to the Python layer the list of device names, together with useful metadata (e.g. if it is a dGPU, if it would be the C++ choice, etc);
  • we change the model loading interface to receive names for the diffusion model / clip / vae device.
  • sdtype_adapter passes the main device name directly to the default device selection at util.cpp (like the main_gpuadaptation I did).

This way, if the Python layer decides it wants to force a device, let C++ pick whatever, or even support fancy aliases like "iGPU", it can. In fact, we could even add support for placing models on different devices before upstream: the diff would be pretty simple at this point.

@wbruna wbruna changed the title sd: sync to master-593-3d6064b sd: sync to master-591-331cfa5 Apr 30, 2026
@LostRuins
Copy link
Copy Markdown
Owner

I cannot remember which model I originally tested that had mis-named tensors.

Tested an extended set of my usual model collections on windows, all seem to work:
sd1.5
sdxl with lora
flux
kontext
z-image turbo
WAN 2.2 14B
Qwen Image Edit
Qwen Image
Klein 4B

btw any idea why leejet keeps doing these massive refactors? kinda sad we get this instead of new model support lol

@LostRuins LostRuins merged commit e2bdd6d into LostRuins:concedo_experimental May 1, 2026
@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 1, 2026

I cannot remember which model I originally tested that had mis-named tensors.

Tested an extended set of my usual model collections on windows, all seem to work: sd1.5 sdxl with lora flux kontext z-image turbo WAN 2.2 14B Qwen Image Edit Qwen Image Klein 4B

Nice. Maybe we could send a PR upstream with the sd_get_u8path change? I would do that myself, but it's tricky to push forward a change that I can't test myself.

btw any idea why leejet keeps doing these massive refactors? kinda sad we get this instead of new model support lol

Some results make sense: e.g. the modularization on this one helped with legacy .pth support, and likely with the gguf -> safetensors conversion. I just wish they weren't so disruptive...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants